Welcome everybody to deep learning. So today we want to continue talking about the different
losses and optimization. We want to go ahead and talk a bit about the details of these
interesting problems. Let's talk first about the loss functions. Loss functions are generally
used for different tasks and for different tasks you have different loss functions. The
two most important tasks that we are facing are regression and classification. So in classification
you want to estimate a discrete variable for every input. This means that you want to essentially
decide in this two class problem here on the left whether it's blue or red dots. So you
need to model a decision boundary. In regression the idea is that you want to model a function
that explains your data. So you have some input function let's say x2 and you want
to predict x1 from it. To do so you compute a function that will produce appropriate values
of x1 for any given x2. Here you can see this is a line fit. We talked about activation
functions, last activation as softmax and coarse entropy loss. Somehow we combined them and
obviously there is a difference between the last activation function in our network and
the loss function. The last activation function is applied to the individual samples x of
the batch. It will also be present at training and testing time. So the last activation function
will become part of the network and will remain there to produce the output or the prediction.
It generally produces a vector. Now the loss function combines all m samples and labels.
In their combination they produce a loss that describes how good the fit is. So it's only
present during training time and the loss is generally a scalar value that describes
how good the fit is. So you only need it during training time. Interestingly many of those
loss functions can be put in a probabilistic framework. This leads us to the maximum likelihood
estimation. In maximum likelihood estimation, just as a reminder, we consider everything
to be probabilistic. So we have a set of observations, capital X, that consists of individual observations.
Then we have associated labels. They also stem from some distribution and the observations
are denoted as y. Of course we need a conditional probability density function that describes
us somehow how y and x are related. In particular we can compute the probability for y given
some observation x. This will be very useful for example if you want to decide on a specific
class. Now we have to somehow model this data set. They are drawn from some distribution
and the joint probability for the given data set can then be computed as a product over
the individual conditional probabilities. Of course if they are independent and identically
distributed you can simply write this up as a large product over the entire training data
set. So you end up with this product over all m samples where it's just a product of
the conditionals. This is useful because we can determine the best parameters by maximizing
the joint probability over the entire training data set. We have to do it by evaluating this
large product. Now this large product has a couple of problems. In particular if we
have high and low values they may cancel out very quickly. So it may be interesting to
transform the entire problem into the logarithmic domain. Because the logarithm is a monotonous
transformation it doesn't change the position of the maximum. Hence we can use the log function
and the negative sign to flip the maximization into a minimization. Instead of looking at
the likelihood function we can look at the negative log likelihood function. Then our
large product is suddenly a sum over all the observations, the negative logarithm of the
conditional probabilities. Now we can look at a univerid Gaussian model. So now we are
in the one dimensional domain again and we can model this with a normal distribution
where we would then choose the output of our network as the expected value and one over
beta as the standard deviation. If we do so we can find the following formulation. Square
root of beta over square root of 2 pi times the exponential function of minus beta times
the label minus the prediction to the power of 2 divided by 2. Ok, so let's go ahead
and put this in our log likelihood function. Remember this is really something that you
should know in the written exam. Everybody needs to know the normal distribution and
everybody needs to be able to convert this kind of univariate Gaussian distribution into
Presenters
Zugänglich über
Offener Zugang
Dauer
00:15:11 Min
Aufnahmedatum
2020-10-10
Hochgeladen am
2020-10-10 12:46:20
Sprache
en-US
Deep Learning - Loss and Optimization Part 1
This video explains how to derive L2 Loss and Cross-Entropy Loss from statistical assumptions. Highly relevant for the oral exam!
For reminders to watch the new video follow on Twitter or LinkedIn.
Further Reading:
A gentle Introduction to Deep Learning